Phase 0 fingerprinting, R8-resistant extraction, Ktor/Apollo/Koin/HMAC patterns#16
Open
tajchert wants to merge 8 commits intoSimoneAvogadro:masterfrom
Open
Phase 0 fingerprinting, R8-resistant extraction, Ktor/Apollo/Koin/HMAC patterns#16tajchert wants to merge 8 commits intoSimoneAvogadro:masterfrom
tajchert wants to merge 8 commits intoSimoneAvogadro:masterfrom
Conversation
Decompiling Java is wasted effort for Flutter, React Native, Cordova/ Capacitor, and Xamarin apps — their code lives in libapp.so, the JS bundle, assets/www/, or .NET DLLs respectively. The previous workflow jumped straight to Phase 1 (install deps) and Phase 2 (decompile), so the agent had no way to know which path to take until after a full jadx run. The new fingerprint.sh inspects an APK/XAPK in seconds and reports: * Detected mobile framework with the file marker that triggered it * HTTP stack hints (Retrofit, OkHttp, Ktor, Apollo, Volley) via DEX string scanning — survives R8 obfuscation * DI and serialization libraries * Obfuscation level estimate * Notable third-party SDKs found in assets/ and DEX * Consolidated native libraries across base + split APKs (split bundles often place .so files only in config.<abi>.apk) * A framework-specific recommendation for the next step SKILL.md documents this as Phase 0 and explicitly tells the agent to stop and switch tooling if the app is non-native. PowerShell port (fingerprint.ps1) intentionally not included — happy to add if needed; behavior is straightforward to mirror.
R8 obfuscates JVM symbols but cannot strip the Kotlin metadata strings —
the Kotlin runtime needs them at runtime for reflection, coroutines, and
data-class features. The original FQNs leak through:
* @DebugMetadata(c = "<real.fqn>") emitted for every coroutine
SuspendLambda (~ every suspend function in modern apps)
* @metadata(d2 = {"L<real/fqn>;"}) on every Kotlin class
Add scripts/recover-kotlin-names.sh that walks decompiled sources, mines
both annotations, and writes an obf -> real mapping (TSV + JSON + per-real-
package index). On a real-world Kotlin app this recovers ~100 % of
*Repository / *ViewModel / *UseCase / *Impl classes — exactly the classes
worth reading.
Add scripts/lookup-name.sh as a CLI over the mapping with four modes:
search by real-name substring, resolve obf -> real, list a real package,
and an annotated `--grep` that suffixes every hit with the owning real
class. This is a strict upgrade over plain grep against decompiled sources.
Replace the misleading 'use --deobf' tip in call-flow-analysis.md with a
pointer to this technique. --deobf only renames symbols with synthetic
placeholders; metadata recovery returns actual developer-written names.
Document the technique, expected recovery rates, and limitations in
references/kotlin-name-recovery.md, and reference it from SKILL.md as
optional Phase 3.5 (only when Phase 0 reports an obfuscated Kotlin app).
The previous find-api-calls.sh covered only Retrofit, OkHttp, and Volley.
Modern Kotlin and KMP apps increasingly ship Ktor as their HTTP client
(used by ~25 % of new Kotlin apps as of 2025), and many product apps use
Apollo Kotlin for GraphQL. Both produced zero hits with the old patterns.
Add two new modes to find-api-calls.sh:
--ktor Ktor client calls (client.get/post/...), HttpRequestBuilder,
defaultRequest blocks, and the Auth bearer DSL
(BearerTokens / loadTokens / refreshTokens)
--apollo ApolloClient, .serverUrl(), HttpNetworkTransport, and
.query/.mutation/.subscription operation calls
Document both in references/api-extraction-patterns.md with example
post-decompile snippets and a note on R8 obfuscation: Ktor call sites
get inlined to obfuscated method calls, but the path string literals
and Ktor library symbols (BearerTokens, URLProtocol, etc.) survive,
so library-internal patterns still work as anchors.
When R8 inlines call sites — client.get("/api/users") becomes
a.b(c, "/api/users") — the existing framework-specific patterns find
nothing, but the path string literal itself is never obfuscated. This
single observation is the most useful endpoint-extraction technique on
heavily shrunk apps; the existing --urls mode only catches full
"https://..." URLs, missing every relative path.
Add a --paths mode that greps for quoted strings matching either:
* an absolute path with at least two slash-separated segments, or
* a relative path beginning with a known API root keyword
(api, v1/v2/v3, graphql, users, auth, profile, cart, order, ...)
with a {0,8}-segment cap and a small denylist for MIME types and system
paths (image/png, /proc/, /sys/, /dev/, etc.) which would otherwise pollute
results.
The output is a deduplicated inventory followed by the full call-site
list. On a real-world Kotlin/Ktor app this produced ~240 distinct API
paths in one shot — paths that the Retrofit/OkHttp/Ktor patterns missed
entirely because every call was inlined. This is the recommended first
extraction step on any obfuscated app.
Document the regex and rationale in references/api-extraction-patterns.md.
The previous --urls mode was a plain grep for "https?://..." which on a
real APK produced thousands of lines, half of them junk strings extracted
from Kotlin stdlib's compression dictionary ("http://An Introduction to..."
fragments) and the other half SDK URLs (Google, Firebase, AppsFlyer,
Datadog, Sentry, ...) that the analyst is not looking for. The signal —
first-party backend hosts — was buried.
Two changes:
1. Strict URL regex: hostname must have at least one dot and end in a 2+
letter TLD, with no whitespace / angle brackets / non-printables in the
path. This eliminates the dictionary-fragment noise.
2. Bucket the surviving URLs into "likely first-party" vs "third-party"
using references/third_party_hosts.txt — a curated denylist of
~80 patterns covering Google/Firebase/Apple/Microsoft/Adobe, attribution
and observability vendors (AppsFlyer, Datadog, Sentry, Bugsnag, ...),
payments (Stripe, PayU, Adyen, ...), support/chat SDKs, CAs, and
standards namespaces (w3.org, etc.).
The new output starts with a frequency-sorted list of likely first-party
hosts — which is the artifact every reverse-engineer wants on the first
page — followed by the collapsed third-party list and the full URL set
for first-party hosts only.
The denylist is a sidecar text file (one regex per line) so users can
extend or override it without editing the script.
Two gaps in the previous coverage:
1. Koin was not mentioned anywhere — Hilt/Dagger got a full section in
call-flow-analysis.md but Koin (the dominant DI in KMP and a large
share of Kotlin-only Android apps) had zero patterns. Add a Koin
subsection with the runtime-DSL patterns (module {}, single<>,
factory<>, viewModel<>, by inject, by viewModel) plus the practical
trick for resolving an interface to its impl after R8 obfuscation:
intersect "files that import org.koin.core.module" with "files that
reference the interface name".
2. The --auth mode caught Bearer / API-key / OAuth header patterns but
missed HMAC and other request-signing schemes. A hardcoded HMAC
secret embedded in an APK is a security finding worth surfacing —
the same kind of authority the user gets is the same authority a
decompiler grants to anyone. Add patterns for:
* JCA primitives: HmacSHA{1,256,512}, Mac.getInstance(...),
SecretKeySpec(...), Signature.getInstance(...)
* Header conventions: X-Signature, X-Hmac, X-Amz-Signature,
X-Client-Authorization, AWS4-HMAC, signRequest(), signaturev2/3
* Likely secret-bearing identifiers: app_secret, client_secret,
signing_key, hmac_secret, consumer_secret, private_key
* Ktor BearerTokens / loadTokens / refreshTokens DSL
These survive R8 because the JCA and Ktor APIs are public and not
shrunk. On a real-world app with a homegrown HMAC scheme they pinpoint
the signing class and its hardcoded key directly.
Without an overview the script dumps thousands of file:line: matches across many sections, leaving the reader to figure out which framework even applies. A short summary at the top makes the rest of the output actionable. The summary counts hits per framework / DI / auth-signal category in a single grep pass over the source tree (8 separate greps would have roughly octupled the runtime on a large decompile). Output is a 3-line table: HTTP framework: Retrofit=N OkHttp=N Ktor=N Apollo=N Volley=N DI framework: Hilt/Dagger=N Koin=N Auth signals: Bearer=N HMAC/Sign=N A reader can immediately see which framework the app actually uses, whether auth is bearer-token or signed, and whether to spend time on a section or skip it. The summary is suppressed when a single section flag (--retrofit, --ktor, --paths, ...) is given, so the existing single-section workflows are unchanged. A reminder of the available section flags is printed below the counts so the agent does not have to consult --help.
…plate
Two small changes that together meaningfully reduce wasted effort:
1. Phase 3 now explicitly tells the agent to read every BuildConfig.java.
These files are almost never obfuscated and routinely contain the
single highest-signal constants in the APK — base URLs, flavor names,
build types, third-party API keys, feature flags. They were not
mentioned in the previous workflow despite being the cheapest possible
high-value target. One grep, finds them all.
2. The Phase 5 documentation template was a single per-endpoint block
asking for path params, query params, request body, response type,
and call chain. On apps with 100+ endpoints that easily becomes hours
of work for output the consumer will not read.
Replace it with two tiers:
* Tier 1 — flat table covering every endpoint (host, method, path,
auth required, source file). Always produced. Takes ~5 minutes
from the --paths output.
* Tier 2 — the existing detailed block, but explicitly reserved for
high-value endpoints: the entire auth flow, payment/checkout, and
anything the user specifically asked about. Default cap of ~10
Tier-2 entries unless asked for more.
This matches the natural shape of how analysts actually use this work
(one inventory table to know the surface area, plus a deep dive on
auth and a couple of flows) and prevents over-investment in detail
for endpoints nobody will read about.
Author
|
This changes were generated with Claude after few sessions of decompiling apps and doing "retro" to see what were biggest issues and time consuming tasks (or dead ends). I'm using this skill after changes locally and seems to work visibly better on my examples. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi! A batch of additive improvements I made while using this skill on a few real-world obfuscated Kotlin/KMP APKs. Each commit is self-contained and gated behind a new flag or a new file — nothing in the existing flow changes behavior unless explicitly opted into.
Phase 0 fingerprint script (
scripts/fingerprint.sh) — inspects an APK/XAPK in seconds and reports framework (Flutter / RN / Cordova / Xamarin / native), HTTP stack, DI, obfuscation level, notable SDKs, and merged native libs across split APKs. Saves a full jadx run when the app turns out to be non-native. Wired into SKILL.md as Phase 0.Kotlin name recovery (
scripts/recover-kotlin-names.sh+lookup-name.sh) — mines@DebugMetadataand@Metadataannotations (which R8 cannot strip, since the Kotlin runtime needs them) to rebuild an obf→real FQN mapping. Recovers ~100% of*Repository/*ViewModel/*UseCaseclasses on R8-stripped apps. Documented as optional Phase 3.5.Ktor and Apollo support in
find-api-calls.sh(--ktor,--apollo) — the old script covered only Retrofit / OkHttp / Volley and missed every modern Kotlin/KMP and GraphQL app.--pathsmode — greps for quoted path literals (which survive R8 inlining of call sites) with a segment-count regex and MIME/system-path denylist. On heavily obfuscated apps this is the only extraction technique that finds anything; recommended as the first step for any obfuscated target.Bucketed
--urlsoutput — strict URL regex (kills Kotlin-stdlib dictionary-fragment noise) plus a sidecarreferences/third_party_hosts.txtdenylist that splits output into likely-first-party vs third-party. The first-party host list is what the analyst actually wants on page 1.Koin DI + HMAC/request-signing patterns — Koin had no coverage despite being dominant in KMP; the
--authmode missed HMAC schemes (hardcoded HMAC secrets are a real security finding worth surfacing). Adds JCA primitives, common signature header names, and Ktor BearerTokens DSL.Summary header in
find-api-calls.sh— single-pass per-framework hit-count table at the top so the reader can see at a glance whether the app is Retrofit or Ktor, bearer or HMAC, before scrolling through thousands of file:line: matches. Suppressed when a single-section flag is given, so existing workflows are unchanged.Docs: BuildConfig.java callout + two-tier endpoint template —
BuildConfig.javais almost never obfuscated and routinely holds base URLs / API keys / flavor names; one grep, highest-signal target, was unmentioned. Phase 5's per-endpoint detail block is split into a Tier-1 inventory table (always produced, ~5 min from--paths) and a Tier-2 deep dive reserved for auth + payment + user-requested flows, capped at ~10 entries by default. Prevents over-investing detail on 100+ endpoints nobody reads.Happy to split this into smaller PRs if you'd prefer, or drop any commit you'd rather not take. PowerShell port of
fingerprint.shintentionally omitted — I can add if you'd like to keep parity withdecompile.ps1.